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Abstract— A sinusoidal model for the speech waveform is used to de- 
velop a new analysis/synthesis technique that is characterized by the 
amplitudes, frequencies, and phases of the component sine waves. 
These parameters are estimated from the short-time Fourier transform 
using a simple peak-picking algorithm. Rapid changes in the highly 
resolved spectral components are tracked using the concept of “birth” 
and “death” of the underlying sine waves. For a given frequency track 
a cubic function is used to unwrap and interpolate the phase such that 
the phase track is maximally smooth. This phase function is applied to 
a sine-wave generator, which is amplitude modulated and added to the 
other sine waves to give the final speech output. The resulting synthetic 
waveform preserves the general waveform shape and is essentially per- 
ceptually indistinguishable from the original speech. Furthermore, in 
the presence of noise the perceptual characteristics of the speech as 
well as the noise are maintained. In addition, it was found that the 
representation was sufficiently general that high-quality reproduction 
was obtained for a larger class of inputs including: two overlapping, 
superposed speech waveforms; music waveforms; speech in musical 
backgrounds; and certain marine biologic sounds. 

Finally, the analysis/synthesis system forms the basis for new ap- 
proaches to the problems of speech transformations including time- 
scale and pitch-scale modification, and midrate speech coding [8], [9]. 

I. Introduction 

O NE approach to the problem of representation of 
speech signals is to use the speech production model 
in which speech is viewed as the result of passing a glottal 
excitation waveform through a time-varying linear filter 
that models the resonant characteristics of the vocal tract. 
In many speech applications it suffices to assume that the 
glottal excitation can be in one of two possible states, cor- 
responding to voiced or unvoiced speech. In attempts to 
design high-quality speech coders at the midband rates, 
generalizations of the binary excitation model have been 
developed. One such approach that is currently popular is 
multipulse [1]. In this paper the goal is also to generalize 
the model for the glottal excitation; but instead of using 
impulses as in multipulse, the excitation waveform is as- 
sumed to be composed of sinusoidal components of arbi- 
trary amplitudes, frequencies, and phases. 

A number of other approaches to analysis/synthesis that 
are based on sine-wave models have been discussed in the 
literature. Hedelin [3] proposed a pitch-independent sine- 
wave model for use in coding the baseband signal for 

Manuscript received April 1, 1985; revised January 10, 1986. This work 
was supported by the U.S. Department of the Air Force. The views ex- 
pressed are those of the authors and do not reflect the official policy or 
position of the U.S. Government. 

The authors are with the Lincoln Laboratory, Massachusetts Institute of 
Technology, Lexington, MA 02173-0073. 

IEEE Log Number 8608125. 


speech compression. The amplitudes and frequencies of 
the underlying sine waves are estimated using Kalman fil- 
tering techniques, and each sine- wave phase is defined to 
be the integral of the associated instantaneous frequency. 
Another sine-wave-based speech compression system is 
being developed by Almeida and Silva [4]. In contrast to 
Hedelin’s approach, their system uses a pitch estimate to 
establish a harmonic set of sine waves. The sine- wave 
phases are computed at the harmonic frequencies. To 
compensate for any errors that might be introduced as a 
result of the harmonic sine- wave representation, a resid- 
ual waveform is coded along with the underlying sine- 
wave parameters. 

In this paper a sinusoidal model for the speech wave- 
form is derived that leads to a new analysis/synthesis 
technique that is characterized by the amplitudes, fre- 
quencies, and phases of the component sine waves. In 
Section II the glottal excitation is represented in terms of 
a sum of sine waves, which, when applied to a time- vary- 
ing vocal tract filter, leads to the desired sinusoidal rep- 
resentation for speech waveforms. In Section III a param- 
eter extraction algorithm is developed that shows that the 
amplitudes, frequencies, and phases of the sine waves can 
be obtained from the high-resolution short-time Fourier 
transform (STFT) by locating the peaks of the associated 
magnitude function. In order to perform speech synthesis 
the amplitudes, frequencies, and phases estimated on one 
frame must be matched and allowed to continuously 
evolve into the set of amplitudes, frequencies, and phases 
estimated on a successive frame. These issues are re- 
solved in Sections IV and V where a frequency-matching 
algorithm is derived along with a solution to the phase 
unwrapping and phase interpolation problem. Experi- 
ments were performed with the resulting system, and the 
synthetic speech was judged to be of excellent quality, 
almost indistinguishable from the original. The results of 
some of these experiments are discussed in Section VI 
where pictorial comparisons of the original and synthetic 
waveforms are made. In addition, it has been found that 
the performance of the analysis/synthesis system did not 
degrade in the presence of environmental disturbances due 
to noise, multiple speakers, or music, and could be used 
to successfully reproduce certain marine biologic sounds. 

II. The Sinusoidal Speech Model 

In the speech production model, the speech waveform 
s(t) is assumed to be the output of passing a glottal exci- 
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tation waveform e(t) through a linear time-varying filter 
that models the characteristics of the vocal tract. If the 
time-varying impulse response of the vocal tract filter is 
h(r; 7 ), then 

s(t ) h(t — t; 7) e(r) dr. (1) 

As an alternative to the binary voiced/unvoiced excitation 
model and to the more general multipulse model, it is pro- 
posed that the excitation signal be represented in terms of 
a sum of sine waves of arbitrary amplitudes, frequencies, 
and phases. This model is written as 

L(t) s' r j w 

e(t) = Re 2] a t (t) exp ; 

where, for the /th sinusoidal component, a, (7) and « ; (7) 
represent the amplitude and frequency and <f>i represents a 
fixed phase offset which accounts for the fact that the sine 
waves will generally not be in phase. This model leads to 
a particularly simple representation for the speech wave- 
form. That this is so becomes apparent by letting 

H { »; 7) = M(co; 7) exp [;$( o>; 7 )] (3) 

represent the time-varying vocal tract transfer function, 
and, assuming that the glottal excitation parameters in (2) 
are constant over the duration of the impulse response of 
the vocal tract filter in effect at time 7, then using (2) and 
(3) in (1) results in the speech model 1 


m 


s(t) = Yj ai(t)M[o)i(t ); t] 


• exp f j 

( c c t (a) da + $[w/(7); 7] + <t>i 
L Jo J 

]• (4) 

By combining the effects of the glottal and vocal tract am- 
plitudes and phases, the representation can be written 
more concisely as 

m 

s(t) = 2 Mt) exp [jMt)l (5) 

where 

Mt) = 

a i (0 t] 

(6) 

Mt) = 

f w/(a) da + t] + <£, 

Jo 

(7) 


represent the amplitude and phase of the /th sine wave 
along the frequency track <a/(7). The next step is to de- 
velop a robust procedure for extracting the amplitudes, 
frequencies, and phases of the component sine waves, a 
subject which will be discussed in the next section. 

in. Estimation of Speech Parameters 

The problem in analysis/synthesis is to take a speech 
waveform, extract the parameters that represent a quasi- 

'The “real part” notation “Re” has been temporarily omitted. 


stationary portion of that waveform, and use those param- 
eters or coded versions of them to reconstruct an approx- 
imation that is “as close as possible” to the original 
speech. Furthermore, it is desirable to have a robust pa- 
rameter extraction algorithm since the speech signal in 
many cases is contaminated by additive acoustic noise. 
The general identification problem in which the speech 
signal is to be represented by multiple sine waves is a 
difficult one to solve analytically. Therefore, the approach 
taken here will be pragmatic, in the sense that an esti- 
mator will be derived based on a set of idealized assump- 
tions; then, once the structure of the ideal estimator is 
known, modifications will be made as the assumptions are 
relaxed to better model practical speech waveforms. 

As a first step, the time line will be broken down into 
a contiguous sequence of frames, each of duration T. The 
center of the analysis window for the kth frame occurs at 
time t k . Assuming that the vocal tract and glottal param- 
eters are constant over an interval of time that includes 
the duration of the analysis window and the duration of 
the vocal tract impulse response, then (7) can be written 
as 

Ut) = «?(* - h) + e) ( 8 ) 

where the superscript is used to indicate that the pa- 
rameters of the model may vary from frame to frame. As 
a consequence of (8) the synthetic speech waveform over 
frame k can be written as 

L k 

s(n) = E 7* exp (jnu|) (9) 

where 7* = A k t exp (j6 k t ) represents the /th complex am- 
plitude for the /th component of the L k sine waves. Since 
the measurements are made on digitized speech, the sam- 
pled-data notation is used throughout this section. In this 
respect, the time index n corresponds to the uniform sam- 
ples of t — t k so that n ranges from —N/2 to N/2, with n 
= 0 being reset to the center of the analysis window for 
every frame and where N + 1 is the duration of the anal- 
ysis window. The problem now is to fit the synthetic 
speech waveform in (9) to the measured waveform, de- 
noted by y(«). A useful criterion forjudging the goodness 
of fit is the mean-squared error, 

€* = 2 |y(n) - s(«)| 2 

n 

= 2 |y(n) | 2 — 2 Re 2 y(n) s*(n) + 2 | s(n) | 2 . (10) 

n n n 

Substituting the speech model of (9) into (10) leads to the 
error expression 

L k 

t k = 2 |y(n) | 2 - 2 Re 2 (y k )* 2 y(n) exp (- jnJ \ ) 

n l = 1 n 

L k L k 

+ (N + 1) S S Y/(y*)* sine (to* — (11) 

l—l i — 1 

where sine (x) = sin [( N + 1) x/2\/[(N + 1) sin (x/2)] . 


W/(cr) da + <Pi 


( 2 ) 
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The problem now is to try to identify a set of sine waves 
that minimizes (11), an identification problem that is, in 
general, difficult to solve. Insights into the development 
of a suitable estimator can be obtained by restricting the 
class of input signals to perfectly voiced speech, in which 
case (9) can be written as 

L k 

s(n) = 7* exp O'n/aio) (12) 

where coq = 2 x/tq and where Tq is the pitch period 
assumed to be constant over the duration of the kth frame. 
For the purpose of establishing the structure of the ideal 
estimator, it is further assumed that the pitch period is 
known and that the width of the analysis window is a mul- 
tiple of Tq. Under these highly idealized conditions, the 
sine (•) function in the last term of (11) reduces to 


sine (o) l - co.) = sine [(/ — i) a) q] 


- 


1 if l = i 
0 if i 
(13) 


where w] = Then the error expression reduces to 


e* = 2 | y(n) | 2 - 2 (N + 1) Re 


L k 


2 ( 7 ?)* T(C6*) 


k |2 


+ N I 


(14) 


where 


Y(u) = 2 yin) exp i~jnu) (15) 

/V~r t n 


is the STFT of the measurement signal. By completing 
the square in (14), the error can be written as 


= 2 | y («)| 2 


L k 

+ (N + 1) 2 
1 = 1 


[|F(4) -ytf- \Yi^)\\ 


(16) 


from which it follows that the optimum estimate for the 
amplitude and phase is 

7 * = Y(la> k 0 ), (17) 

which reduces the error to 

L k 

e k = 2 | y(n) \ 2 - (N + 1) 2 | F(/4) | 2 - (18) 

n l = 1 


equation leads to an intuitive generalization to the prac- 
tical case. This is done by considering the function 
| Y(w) | 2 to be a continuous function of co. For the ideal- 
ized voiced speech case, this function (referred to as the 
periodogram) will be pulselike in nature, with peaks oc- 
curring at all of the pitch harmonics. Therefore, the fre- 
quencies of the underlying sine waves correspond to the 
location of the peaks of the periodogram, and the esti- 
mates of the amplitudes and phases are obtained by eval- 
uating the STFT at the frequencies of the peaks. The ad- 
vantage of this latter interpretation of the estimator 
structure is that it can be applied when the ideal voiced 
speech assumption is no longer valid. That this is so can 
be seen by calculating the STFT for the general sinusoidal 
speech model in (9). In this case the STFT is simply 

L k 

F(co) = X 7 ■* sine (c J\ ~ co). (19) 


Provided the analysis window is “wide enough” that 



47 r 

“ N + l 9 


( 20 ) 


then the periodogram can be written as 

| F(co) I 2 « ' S 1 7; I 2 sine 2 (4 - «), (21) 


and as before, the location of the peaks of the periodo- 
gram corresponds to the underlying sine-wave frequen- 
cies and the STFT samples at these frequencies corre- 
spond to the complex amplitudes. Therefore, the structure 
of the ideal estimator applies to a more general class of 
speech waveforms provided (20) holds. Since, during 
steady voicing, neighboring frequencies are separated by 
the pitch fundamental, (20) suggests that the desired res- 
olution can be achieved “most of the time” by requiring 
that the analysis window be at least two pitch periods 
wide. Of course, these properties are based on the as- 
sumption that the sine ( * ) function is essentially zero out- 
side of the region defined by (20). In fact, this is not a 
valid approximation and there will be sidelobes outside of 
this region which will lead to leakage that will compro- 
mise the performance of the estimator. These sidelobes 
are due to the rectangular window that is implicit in the 
definition of the STFT, a problem which is reduced but 
not eliminated by using the weighted STFT. Letting 
T(co) denote the weighted STFT, i.e.. 

Nil 

Y( to) = X w(n) y(n) exp ( - jnai) (22) 

n = -N/2 


From this it follows that the error is minimized by select- 
ing all of the harmonic frequencies in the speech band- 
width Q (i.e., L k — Q/o?o). 

Equations (15), (17), and (18) completely specify the 
structure of the ideal estimator and show that the speech 
data are manifest in the optimum estimator through the 
DFT. Although these results are equivalent to a Fourier 
series representation of a periodic waveform, the above 


where w(n) represents the temporal weighting due to the 
window function, then the practical version of the ideal- 
ized estimator estimates the frequencies of the_underlying 
sine waves as the locations of the peaks of | Y(o>) \ (i.e., 
the frequency at which the slope changes from positive to 
negative). Letting these frequency estimates be denoted 
by then the corresponding complex amplitudes are 

given by 
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Fig. 1 . Typical periodogram for a frame of voiced speech and the amplitude and frequency estimates of the underlying sine 

waves. 


7/ = Yitf) = A\ exp {j6\). (23) 

Assuming that the component sine waves have been prop- 
erly resolved, then, in the absence of noise, a\ will yield 
the value of an underlying sine wave provided the window 
is scaled so that 

Nil 

E w(n) = 1. (24) 

n ~ -NI2 

The Hamming window was used in all of the experi- 
ments reported in this paper, and while this resulted in a 
very good sidelobe structure, it did so at the expense of 
broadening the mainlobes of the periodogram estimator. 
Therefore, in order to maintain the resolution properties 
that were needed to justify the optimality properties of the 
periodogram processor, the constraint implied by (20) is 
revised to require that the window width be at least 2\ 
times the pitch period. Although the window width could 
be set on the basis of the instantaneous pitch, it is ade- 
quate to adapt it to the average pitch, as this makes the 
analyzer less sensitive to the performance Of the pitch ex- 
tractor.. On adjusting the analysis window, the average 
pitch and the window width are continually being updated 
in real time using the pitch computed during strongly 
voiced frames and are averaged using a \ s time constant. 
During frames of unvoiced speech, the window is held 
fixed at the value obtained on the preceding voiced frame. 
Once the width for a particular frame has been specified, 
the Hamming window is computed, normalized according 
to (24), and the STFT of the input speech is taken using 
a 512-point FFT. Plotted in Fig. 1 is a typical periodo- 
gram for voiced speech, along with the amplitudes and 
frequencies that are estimated using the above procedure. 

The purpose of the preceding analysis was to produce 


an estimator structure that was closely related to the op- 
timal estimator derived on the basis of ideal voiced 
speech. The approximations that were introduced were 
based on properties that were more representative of re- 
alistic voiced speech. Nowhere have the properties of un- 
voiced speech been taken into account. To do this in an 
optimal way requires use of the Karhunen-Loeve expan- 
sion for noiselike signals [2] . Such an analysis shows that 
a sinusoidal representation is valid, provided the frequen- 
cies are “close enough” that the ensemble power spectral 
density changes slowly over consecutive frequencies. In 
order to apply the sinusoidal model to unvoiced speech, 
therefore, it is necessary to assume that the frequencies 
corresponding to the periodogram peaks will be “close 
enough” to satisfy the requirement imposed by the Kar- 
hunen-Loeve expansion. If the window width is con- 
strained to be at least 20 ms wide, then, “on the aver- 
age,” this will lead to a set of periodogram peaks that 
will be approximately 100 Hz apart, and this should pro- 
vide a sufficiently dense sampling to satisfy the con- 
straints of the Karhunen-Loeve sinusoidal representation 
for the unvoiced case. Plotted in Fig. 2 is a typical pe- 
riodogram for a frame of unvoiced speech along with the 
amplitudes and frequencies that are estimated using the 
above procedure. 

The above analysis provides a justification for the rep- 
resentation of the speech waveform in terms of the am- 
plitudes, frequencies, and phases of a set of sine waves. 
This representation applies to one analysis frame. Differ- 
ent sets of these parameters will be obtained for each 
frame. The next problem to address then is the association 
of amplitudes, frequencies, and phases measured on one 
frame with those that are obtained on a successive frame. 
This is the subject addressed in the next section. 
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Fig. 2. Typical periodogram for a frame of unvoiced speech and the am- 
plitude and frequency estimates of the underlying sine waves. 


IV. Frame-to-Frame Peak Matching 

If the number of peaks were constant from frame to 
frame, the problem of matching the parameters estimated 
on one frame with those on a successive frame would sim- 
ply require a frequency-ordered assignment of peaks. In 
practice, however, there will be spurious peaks that come 
and go due to the effects of sidelobe interaction; the lo- 
cations of the peaks will change as the pitch changes; and 
there will be rapid changes in both the location and the 
number of peaks corresponding to rapidly varying regions 
of speech, such as at voiced/unvoiced transitions. In order 
to account for such rapid movements in the spectral peaks, 
the concept of “birth” and “death” of sinusoidal com- 
ponents is introduced. The problem of matching spectral 
peaks in some “optimal” sense, while allowing for this 
birth-death process, is generally a difficult problem. One 
method, which has proven to be successful for signal re- 
construction, is now described. 

Suppose that somehow peaks up to frame k have been 
matched and a new parameter set for frame k + 1 is gen- 
erated. Let the chosen frequencies on frames k and k + 1 
be denoted by coq, c 4, • • * , and a>o +1 , u* +I , * , 

o) M _ p respectively, where for convenience the “ ” no- 
tation of the previous section has been dropped, and where 
N and M represent the total number of peaks selected on 
each frame (N =£ M in general). The process of matching 
each frequency in frame k , co*, to some frequency in frame 
k + 1, w* +l , is given in the following four steps. 

Step 1: Suppose that a match has been found for fre- 
quencies a>o, (fi\ 9 • • * , co*^. A match is now attempted 
for frequency co*. Fig. 3(a) depicts the case where all fre- 
quencies c4 +1 in frame k + 1 lie outside a “matching 
interval” A of co*, i.e., 

|^-4 +1 |>A (25) 

for all m. In this case the frequency track associated with 
co* is declared “dead” on entering frame k + 1, and co* 


is matched to itself in frame k + 1 , but with zero ampli- 
tude. Frequency co* is then eliminated from further con- 
sideration, and step 1 is repeated for the next frequency 
in the list, co* + j . 

If, on the other hand, there exists a frequency co* + 1 in 
frame k + 1 that lies within the matching interval about 
co*, and is the closest such frequency, i.e., 

|«5-o)* +1 | < |«* - «* +l | < A (26) 

for all i =£ m, then co*^ 1 is declared to be a candidate 
match to co*. A definitive match is not yet made since there 
may exist a better match in frame k to the frequency 
co* +1 , a contingency which is accounted for in step 2. 

Step 2: In this step, a candidate match from step 1 is 
confirmed. Suppose that a frequency co* of frame k has 
been tentatively matched to frequency co*, + 1 of frame k + 
1. Then, if co^ +1 has no better match to the remaining 
unmatched frequencies of frame k , the candidate match is 
declared to be a definitive match. This condition, illus- 
trated in Fig. 3(c), is given by 

l<*4 +1 - J n \ < |c4 +1 - w k i+l \ for / > n. (27) 

When this occurs, frequencies co* and co* + 1 are eliminated 
from further consideration and step 1 is repeated for the 
next frequency in the list, a>* + 

If condition (27) is not satisfied, then the frequency 
co* + 1 in frame k + 1 is better matched to the frequency 
co* + I in frame k than it is to the test frequency co*. Two 
additional cases are then considered. In the first case, il- 
lustrated in Fig. 3(d), the adjacent remaining lower fre- 
quency co* ' (if one exists) lies below the matching in- 
terval, hence, no match can be made. As a result, the 
frequency track associated with co* is declared “dead” on 
entering frame k 4- 1, and to* is matched to itself with 
zero amplitude. In the second case, illustrated in Fig. 3(e), 
the frequency to* is within the matching interval about 
to*, and a definitive match is made. After either case, step 
1 is repeated using the next frequency in the list to* + x . It 
should be noted that many other situations are possible in 
this step, but to keep the tracker alternatives as simple as 
possible, only the two cases discussed were implemented. 

Step 3: When all frequencies of frame k have been 
tested and assigned to continuing tracks or to dying tracks, 
there may remain frequencies in frame k + 1 for which 
no matches have been made. Suppose that co* + 1 is one 
such frequency; then jt is concluded that co* + 1 was “bom” 
in frapie k, and its match, a new frequency o4 +1 ,Js cre- 
ated in frame k with zero magnitude. This is done for all 
such unmatched frequencies. This last step is illustrated 
in Fig. 3(f). _ 

An^illustration of the effects of the i birth-deafil.,P roce " 
dure to account for extraneous peaks is shown in Fig. 4. 
The results of applying the tracker to a segment of real 
speech are shown in Fig. 5, which demonstrates the ability 
of the tracker to adapt quickly through transitory speech 
behavior such as voiced/unvoiced transitions and mixed 
voiced/unvoiced regions. 
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by 11.5 and 23.0 ms, respectively. While the synthetic 
speech produced by the first system was quite good, al- 
most indistinguishable from the original, the longer frame 
interval resulted in synthetic speech that was “rough’ ’ 
and, although very intelligible, was deemed to be of poor 
quality. Therefore, if a particular application can support 
a high frame rate, then the overlap-add synthesizer is a 
good system to use. However, there are many practical 
situations, such as midrate speech coding, where lower 
frame rates are necessary, for which an alternative to the 
overlap-add synthesizer must be developed. A method 
will now be described that interpolates the matched sine 
wave parameters directly. 

As a result of the frequency-matching algorithm de- 
scribed in the previous section, all of the parameters mea- 
sured for an arbitrary frame k are associated with a cor- 
responding set of parameters for frame k + 1. Letting 
(A*, &*, §* ) and (A k + \ a>f +1 , 0f + 1 ) denote the successive 
sets of parameters for the /th frequency track, then an ob- 
vious solution to the amplitude interpolation problem is 
to take 

where n — 0, 1, • * * , S — 1 is the time sample, into the 
kth frame. (The track subscript “/” has been omitted for 
convenience.) 

Unfortunately, such a simple approach cannot be used 
to interpolate the frequency and phase because the mea- 
sured phase 0 k is obtained modulo 2tt. Hence, phase un- 
wrapping must be performed to ensure that the frequency 
tracks are “maximally smooth” across frame boundaries. 
The first step in solving this problem is to postulate a phase 
interpolation function that is a cubic polynomial, namely, 

0(f) = f + yt + at 1 + jSr 3 . (30) 

It is convenient to treat the phase function as though it 
were a function of a continuous time variable t 9 with 
t ~ 0 corresponding to frame k and t — T corresponding 
to frame k + 1. Since the derivative of the phase is the 
frequency, it is necessary that the cubic phase function 
and its derivative equal the phases and frequencies mea- 
sured at the frame boundaries. This idea of applying a 
cubic polynomial to interpolate the phase between frame 
boundaries was independently proposed by Almeida and 
Silva for use in their harmonic sine-wave synthesizer [4] . 
Since only the principal value of the phase can be mea- 
sured, provision must also be made for unwrapping the 
phase subject to the above constraints on the cubic phase 
interpolation function. In this paper an explicit solution is 
obtained for interpolation and phase unwrapping by in- 
voking an additional constraint which requires that the un- 
wrapped cubic phase function be “maximally smooth.” 
The mathematics leading to the complete solution are now 
presented. 

Using the fact that the instantaneous frequency is the 
derivative of the phase, then 


d(t) = 7 + 2 at + 3(3t 2 , (31) 

and it follows that at the starting point, t. = 0, 

0(0) = f = e k 

0 ( 0 ) = 7 = (32) 

and at the terminal point, t = T, 

6(T) = e k + <b k T + a T 2 + (3T 3 = 9 k + l + 2 irM 

d(T) = 6> k + 2aT + 3/3 7 2 = o>* +1 (33) 

where again the track subscript is omitted for con- 
venience. Since the terminal phase 0 k+ 1 is measured mod- 
ulo 27 r, it is necessary to augment it by the term 2wM (M 
is an integer) in order to make the resulting frequency 
function “maximally smooth,” a concept that will be 
quantified in the sequel. At this point M is unknown, but 
for each value of M, whatever it may be, (33) can be 
solved for a(M) and (3(M ) (the dependence on M has now 
been made explicit). The solution is easily shown to sat- 
isfy the matrix equation 




"3 -1 


a{M)~ 


T 2 T 


~e k+l - e k - 0> k T + 2irM~ 

-(3(M)- 


-2 1 


-w* +1 - a> k 



j»3 y2 



(34) 

In order to determine M and ultimately the solution to the 
phase unwrapping problem, an additional constraint needs 
to be imposed that quantifies the “maximally smooth” 
criterion. Fig. 6 illustrates a typical set of cubic phase 
interpolation functions for a number of values of M. It 
seems clear on intuitive grounds that the best phase func- 
tion to pick is the one that would have the least variation. 
This is what is meant by a maximally smooth frequency 
track. In fact, if the frequencies were constant and the 
vocal tract were stationary, the true phase would be lin- 
ear. Therefore, a reasonable criterion for “smoothness” 
is to choose M such that 

f(M) = ( M)f dt (35) 

Jo 


is a minimum where 8(t; M) denotes the second derivative 
of d(t; M) with respect to the time variable t. 

Although M is integer valued, since f(M) is quadratic 
in M, the problem is most easily solved by minimizing 
f(x) with respect to the continuous variable x and then 
choosing M to be the integer closest to x. After straight- 
forward but tedious algebra, it can be shown that the min- 
imizing value of x is 


27T 


0 k + a ) k T - e k+l ) + (o> k + 1 



(36) 


from which M* is determined and used in (34) to compute 
a(M * ) and j 8(M * ) , and in turn, the unwrapped phase in- 



MC AULAY AND QUATIERI: SPEECH ANALYSIS/SYNTHESIS 


751 



terpolation function 

hi) = 0* + a h + a(M*)t 2 + (3(M*)t 3 . (37) 

This phase function not only satisfies all of the measured 
phase and frequency endpoint constraints, but also un- 
wraps the phase in such a way that d(t) is maximally 
smooth. 

Since the above analysis began with the assumption of 
an initial unwrapped phase d k corresponding to frequency 
o>* at the start of frame k, it is necessary to specify the 
initialization of the frame interpolation procedure. This is 
done by noting that at some point in time the track under 
study was bom. When this event occurred, an amplitude, 
frequency, and phase were measured at frame k + 1, and 
the parameters at frame k to which these measurements 
correspond were defined by setting the amplitude to zero 
(i.e. , A k = 0) while maintaining the same frequency (i.e. , 
Co k = tc k+l ). In order to ensure that the phase interpolation 
constraints are satisfied initially, the unwrapped phase is 
defined to be the measured phase § k+x and the startup 
phase is defined to be 

0* = & k+l - & k+l S (38) 

where S is the number of samples traversed in going from 
frame k + 1 back to frame k. 

As a result of the above phase unwrapping procedure, 
each frequency track will have associated with it an in- 
stantaneous unwrapped phase which accounts for both the 
rapid phase changes due to the frequency of each sinu- 
soidal component and the slowly varying phase changes 
due to the glottal pulse and the vocal track transfer func- 
tion. Letting 0/(f) denote the unwrapped phase function 
for the /th track, then the final synthetic waveform will be 


given by 

z* 

s(n) = S A,(n) cos [0/(n)] (39) 

where A ; (n) is given by (29), 0,(») is the sampled data 
version of (37), and L k is the number of sine waves esti- 
mated for the kth frame. 

This completes the theoretical basis for the new sinu- 
soidal analysis/synthesis system. Although extremely 
simple in concept, the detailed analysis led to the intro- 
duction of the birth-death frequency tracker and the cubic 
interpolation phase unwrapping procedure. The useful- 
ness with which these new procedures aid in the synthesis 
of speech will be discussed in the next section. 

VI. Experimental Results 
A block diagram description of the analysis/synthesis 
system is given in Fig. 7. A nonreal time floating-point 
simulation was developed in order to determine the effec- 
tiveness of the proposed approach in modeling real 
speech. The speech processed in the simulation was low- 
pass filtered at 5 kHz, digitized at 10 kHz, and analyzed 
at 10 ms frame intervals. A 512-point FFT using a pitch- 
adaptive Hamming window, having a width which was 
2.5 times the average pitch, was found to be sufficient for 
accurate peak estimation. The maximum number of peaks 
that are used in synthesis was set to a fixed number ( ~ 80) , 
and if excessive peaks were obtained only the largest 
peaks were used. A large speech database has been pro- 
cessed with this system, and it has been found that the 
synthetic speech was perceived to be essentially indistin- 
guishable from the original. Visual examination of many 
of the reconstructed passages shows that the waveform 
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SYNTHESIS: 



SYNTHETIC 

SPEECH 

OUTPUT 


Fig. 7. Block diagram of the sinusoidal analysis/synthesis system. 


structure is essentially preserved. An example of this 
property is shown in Fig. 8, which compares the wave- 
forms for the original speech and the reconstructed speech 
during an unvoiced/ voiced speech transition. This sug- 
gests that the quasi-stationarity conditions seem to be sat- 
isfactorily met and that the use of the parametric model 
based on the amplitudes, frequencies, and phases of a set 
of sine-wave components appears to be justifiable for both 
voiced and unvoiced speech. 

Although the sinusoidal model was originally designed 
for a single speaker, it can represent any waveform con- 
sisting of a sum of sine waves with time-varying ampli- 
tudes and frequencies. Thus, the analysis/synthesis sys- 
tem should be capable of synthesizing a broader class of 
signals. This hypothesis was verified by successfully re- 
constructing multispeaker waveforms, music, speech in a 
musical background, and marine biologic signals such as 
whale sounds. Furthermore, it was found that the recon- 
struction does not break down in the presence of noise. 
The synthesized speech is perceptually nearly indistin- 
guishable from the original noisy speech with essentially 
no modification of the noise characteristics. Illustrations 
depicting the performance of the system in the face of the 
above degradations are provided in [10]. 

Although high-quality analysis/synthesis of speech has 
been demonstrated using the amplitudes, frequencies, and 
phases of the peaks of the high-resolution STFT, it is often 
argued that the ear is insensitive to phase, a proposition 
which forms the basis of much of the work in narrow- 
band speech coders. The question arises as to whether or 
not the phase measurements are essential to the sum of 
sine waves synthesis procedure. An attempt to explore this 
question was made by replacing each cubic phase track 
by a phase function that was defined to be the integral of 
the instantaneous frequency [3] , [5] . In this case the in- 
stantaneous frequency was taken to be the linear interpo- 
lation of the frequencies measured at the frame bound- 
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Fig. 8. Sinusoidal reconstruction of speech. 


aries, and the integration, which started from a zero value 
at the birth of the track, continued to be evaluated along 
that track until that track died. This “magnitude-only” 
reconstruction technique was applied to several sentences 
of speech, and, while the resulting synthetic speech was 
very intelligible and free of artifacts, it was perceived as 
being different from the original speech. Furthermore, the 
differences were more pronounced for low-pitched (i.e., 
pitch < — 100 Hz) speakers. An example of a waveform 
synthesized by the “magnitude-only” system is shown in 
Fig. 9(b). Compared to the original speech, shown in Fig. 
9(a), the synthetic waveform is quite different owing to 
the failure to maintain the true sine- wave phases. In an 
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Fig. 9. Magnitude-only reconstruction of speech. 


additional experiment the magnitude-only system was ap- 
plied to the synthesis of noisy speech, and it was found 
that the synthetic noise took on a tonal quality that was 
unnatural and annoying. 

VII. Conclusions 

A sinusoidal representation for the speech waveform has 
been developed that extracts the amplitudes, frequencies, 
and phases of the component sine waves from the STFT 
(short-time Fourier transform). The parameter extraction 
routine is robust in noise in the sense that the parameters 
are obtained by coherently processing the data over the 
analysis window. 

In order to account for spurious effects due to sidelobe 
interaction and time-varying voicing and vocal tract 
events, sine waves are allowed to come and go in accor- 
dance with a birth-death frequency -tracking algorithm. 
Once contiguous frequencies are matched, a smooth cubic 
phase interpolation function is obtained that is consistent 
with all of the frequency and phase measurements and is 
maximally smooth. This phase function is applied to a 
sine-wave generator, which is amplitude modulated and 
added to the other sinusoidal components to give the final 
speech output. 

The analysis/synthesis system was applied to clear 
speech and speech that was subjected to various types of 
interference. Synthetic speech that was natural and of high 
quality was generated in every case. The system could 
also be used to develop a parametric representation for 
nonspeech sounds such as music and certain marine bio- 
logic sounds. Finally, it is important to note that, except 
in updating the average pitch used to adjust the width of 


the analysis window, no voicing decisions are used in the 
analysis/synthesis procedure. 

In some respects the basic model has similarities to one 
that has been proposed by Flanagan [6] . Flanagan argues 
that because of the nature of the peripheral auditory sys- 
tem, a speech waveform can be expressed as the sum of 
the outputs of a fixed filter bank. The amplitude, fre- 
quency, and phase measurements of the filter outputs are 
then used in various configurations of speech synthesizers 
[7]. Although the present work is based on the discrete 
Fourier transform (DFT), which can be interpreted as a 
filter bank, the use of a high-resolution DFT in combi- 
nation with peak picking renders a highly adaptive filter 
bank since only a subset of all of the DFT filters is used 
at any one frame. It is the use of the frequency tracker 
and the cubic phase interpolator that allows the filter bank 
to move with the highly resolved speech components. 
Therefore, the system fits into the framework described 
by Flanagan, but, whereas Flanagan’s approach is based 
on the properties of the peripheral auditory system, the 
present system is designed on the basis of properties of 
the speech production mechanism. 

Attempts to perform “magnitude-only” reconstruction 
were made by replacing the cubic phase tracks by a phase 
that was simply the integral of the instantaneous fre- 
quency. While the resulting speech was very intelligible 
and free of artifacts, it was perceived as being different in 
quality from the original speech; the differences were 
more pronounced for low-pitched (i.e., pitch < ~ 100 Hz) 
speakers. When the magnitude-only system was used to 
synthesize noisy speech, the synthetic noise took on a 
tonal quality that was unnatural and annoying. It was con- 
cluded that this latter property would render the system 
unsuitable for applications for which the speech would be 
subjected to additive acoustic noise. 

While it may be tempting to conclude that the ear is not 
phase deaf, particularly for low-pitched speakers, it may 
be that this is simply a property of the sinusoidal analysis/ 
synthesis system. No attempts were made to devise an 
experiment that would resolve this question conclusively. 
It was felt, however, that the system was well suited to 
the design and execution of such an experiment since it 
provides explicit access to a set of phase parameters that 
are essential to the high-quality reconstruction of speech. 

It is important to note that the use of the frequency 
tracker and the cubic phase interpolation function resulted 
in a functional description of the time evolution of the 
amplitude and phase of the sinusoidal components of the 
synthetic speech. For the applications for which the sys- 
tem was developed (i.e., time-scale, pitch-scale, and fre- 
quency modification of speech and speech coding) such a 
functional model is essential. It should be noted, how- 
ever, that if the system were to be applied simply to 
achieve synthesis using a set of sine waves, then the fre- 
quency tracking and phase interpolation procedures would 
be unnecessary. In this case a solution is achieved simply 
by overlapping and adding time-weighted segments of 
each of the sinusoidal components. The resulting syn- 
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thetic speech was found to be essentially perceptually in- 
distinguishable from the original speech, provided the 
frame rate was on the order of 100 Hz. 

Finally, it should be noted that a fixed-point 16-bit real- 
time implementation of the system has been developed on 
the Lincoln Digital Signal Processors [11]. Diagnostic 
rhyme tests (DRT) have been performed, and it has been 
found that about one DRT point is lost relative to the un- 
processed speech of the same bandwidth with the analy- 
sis/synthesis system operating at a 50 Hz frame rate. Cur- 
rently, the system is being used in research aimed at the 
development of a midrate speech coder and has already 
been applied successfully to problems in time-scale, pitch- 
scale, and frequency modification of speech. Preliminary 
results on the application of the sinusoidal-based system 
to the speech transformation and coding problems have 
been reported in [8] and [9] . 
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